Statistical Inference II

Chelsea Parlett-Pelleriti

Evaluating an Estimator

Review: Estimands, Estimator, Estimates

  • Estimand: a target quantity to be estimated

    • \(\mu\) mean height of all people named Michael in the US
  • Estimator: a function, \(W(\mathbf{x})\) that is a recipe about how to get an estimate from a sample

    • \(\bar{x} = W(\mathbf{heights}) = \frac{1}{N} \sum_{i=1}^N height_i\)
  • Estimate: a realized value of \(W(\mathbf{x})\) applied to an actual sample, \(\mathbf{x}\)

    • \(\bar{x} = W(176,177,175,179,173) = 176\)

Evaluating an Estimator

The goal of an estimator is to estimate the estimand well.

This is important because it’s what allows us to make inferences about a Population statistic based on a Sample statistic. This is the core of inference. Good estimators will be:

  • Unbiased

  • Consistent

  • Efficient

Evaluating an Estimator: Unbiasedness

Intuitive Idea: our estimator, \(\hat{\theta}\) should not consistently mis-estimate \(\theta\) in a systematic way

from Scott Fortmann-Roe

Evaluating an Estimator: Unbiasedness

Math Idea: an estimator \(\hat{\theta}\) is unbiased if

\[ \mathbb{E}(\hat{\theta}) = \theta \]

Evaluating an Estimator: Unbiasedness

Sample Mean Estimator

\[ \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{X}_i \]

\[ \mathbb{E}[\hat{\mu}] = \mathbb{E} \left [ \frac{1}{n}\sum_{i=1}^{n}\mathbf{X}_i\right] = \frac{1}{n}\sum_{i=1}^{n}\mathbb{E} \left [ \mathbf{X}_i\right] = \mu \]

Thus, the sample mean, \(\hat{\mu}\) is an unbiased estimator of the population mean \(\mu\)

Note we’re using \(\hat{}\) in this section (pronounced “hat”) for consistency. Common estimators like the sample mean often have their own symbols like \(\bar{x}\) which you’ll also see used. In general, putting a \(\hat{}\) on something means that we’re creating an estimate of it.

Evaluating an Estimator: Unbiasedness

Population Variance = \(\frac{1}{n}\sum(x_i-\mu)^2\) Sample Variance = \(\frac{1}{n}\sum(x_i-\bar{x})^2\)

Sample Variance

# Function to demonstrate bias in sample variance estimation
demonstrate_variance_bias <- function(n = 10, true_mean = 0, true_var = 1, n_sims = 1000) {
  # Initialize vectors to store estimates
  var_biased <- numeric(n_sims)
  var_unbiased <- numeric(n_sims)
  
  # Run simulations
  for (i in 1:n_sims) {
    # Generate random normal data
    x <- rnorm(n, mean = true_mean, sd = sqrt(true_var))
    # Biased estimator: divide by n
    var_biased[i] <- sum((x - mean(x))^2) / n
    # Unbiased estimator: divide by n-1 
    var_unbiased[i] <- sum((x - mean(x))^2) / (n-1)
  }
  return(data.frame(biased = var_biased, unbiased = var_unbiased))
}

Sample Variance

Sample Variance

\[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \left( \mathbf{X}_i - \hat{\mu}\right)^2 \]

\[ \mathbb{E}[\hat{\sigma}^2] = \mathbb{E}\left[ \frac{1}{n} \sum_{i=1}^n \left( \mathbf{X}_i - \hat{\mu}\right)^2 \right] = \frac{n-1}{n}\sigma^2 \]

Thus, the sample mean, \(\hat{\sigma}^2\) is a biased estimator of the population variance \(\sigma^2\)

Note: this is why, when calculating the sample variance, we divide by \(N-1\) instead of \(N\). Intuitively this makes sense: when we estimate the sample mean \(\hat{\mu}\), we’re losing 1 degree of freedom, then we use that estimate to estimate \(\hat{\sigma}^2\)

Evaluating an Estimator: Consistency

Intuitive Idea: as we collect more data (information) the estimator should approximate the estimand more closely.

If we could have \(\infty\) information, our estimator should spit out estimates equal to the estimand.

Evaluating an Estimator: Consistency

Math Idea:

\[ \lim_{n \to \infty} \hat{\theta} = \theta \]

Example:

Sample Mean: The Law of Large Numbers guarantees that:

\[ \lim_{n \to \infty} \hat{\mu} = \mu \]

Evaluating an Estimator: Consistency

Weak Law of Large Numbers

Intuitive Idea: as you get more and more independent, random samples of \(\mathbf{X}\), the sample mean of \(\mathbf{X}\) will get closer and closer (and eventually converge) to it’s expected value.

Math Idea: for all \(\epsilon > 0\), if \(\sigma^2 < \infty\)

\[ \lim_{n \to \infty} P(|\bar{X_n} - \mu| < \epsilon) = 1 \]

Evaluating an Estimator: Consistency

Weak Law of Large Numbers Proof

For a random variable \(X\) with finite variance \(\sigma^2\) and expected value \(\mu\)

\[ P(|\bar{X_n} - \mu| \geq \epsilon) = \frac{Var(\bar{X})}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2} \]

As \(n \to \infty\), \(\frac{\sigma^2}{n\epsilon^2} \to 0\). So the probability that \(|\bar{X_n} - \mu| \geq \epsilon\) goes to \(0\). Thus, \(P(|\bar{X_n} - \mu| < \epsilon) \to 1\)

Note: we’re using Chebychev’s Inequality here which states that \(P(|X-\mu| \geq k) \leq \frac{\sigma^2}{k^2}\)

Evaluating an Estimator: Efficiency

Intuitive Idea: the estimate we get should have the smallest variance possible (so that we can be more confident about our estimate with as little data as possible)

Evaluating an Estimator: Efficiency

Math Idea:

\[ Var(\hat{\theta}) \geq \frac{1}{I(\theta)} \]

where \(I(\theta)\) is the Fisher Information for \(\theta\)

Evaluating an Estimator: Efficiency

Fisher Information

Intuitive Idea: the amount of information that a sample from a random variable \(\mathbf{X}\) can give us about a parameter \(\theta\)

  • Imagine that everyone in Room A has the same number of cats (\(\mu_A\))

  • Imagine that cat ownership in Room B is defined as \(Pois(\mu_B)\) where \(\mu_B\) is the mean number of cats owned in Room B

In which Room do I learn more about \(\mu\) by asking one person how many cats they own?

Evaluating an Estimator: Efficiency

Fisher Information

Intuitive Idea: the amount of information that a sample from a random variable \(\mathbf{X}\) can give us about a parameter \(\theta\)

  • Imagine that Room A has \(\text{height}_{cm} \sim \mathcal{N}(\mu_A, 8)\)

  • Imagine that Room B has \(\text{height}_{cm} \sim \mathcal{N}(\mu_B, 1)\)

In which Room do I learn more about \(\mu\) by asking one person their height?

Evaluating an Estimator: Efficiency

Fisher Information

Slightly Mathy Idea: Fisher Information measures how sensitive the log-likelihoodfunction \(\ell(\theta | X)\) is to changes in \(\theta\) (more sensitive \(\to\) more information)

Evaluating an Estimator: Efficiency

Fisher Information

Math Idea:

\[ I_X(\theta) = -\mathbb{E}\left[\frac{\partial^2 \ell(\theta | X)}{\partial\theta^2} \right] \]

where \(\ell(\theta | X)\) is the log-likelihood of \(\theta\) given \(X\). If \(\ell\) is sensitive to changes in \(\theta\), the second derivative should be large and we expect to see high information

Note: usually \(\ell\) is concave down around maximum likelihood estimate, meaning that the second-derivative will be negative, hence the negative sign in front of the expectation

Estimators Wrap-Up

  • Estimators take in sample data and produce a sample estimate

  • Good estimators produce estimates that allow us to make inferences about population parameters

    • Unbiased, Consistent, Efficient
  • These estimates are (so far) individual numbers that are guesses for population parameters

Point Estimates vs. Interval Estimates

  • Point Estimate: a single value calculated based on a sample that estimates a population parameter

  • Interval Estimate: a range of values calculated based on a sample that estimate a population parameter with uncertainty

E.g. the mean height of Michaels is 178cm vs. the mean height of Michaels is between 175-181cm

Point Estimates vs. Interval Estimates

Think about the research or industry work you’ve done. When would interval estimates have been helpful?

Point Estimates vs. Interval Estimates

Think about the research or industry work you’ve done. When would interval estimates have been helpful?

My Story: the clients who didn’t report uncertainty…

Inference: Second Problem

In the last class, we talked about the first problem of inference: data is too complex to reason about, we need summaries. But now that we’ve exhaustively discussed the problem of point estimates, we run into the second problem of inference…

Uncertainty.

Inference: Second Problem

Claim: My mean crossword time is faster than yours.

  • my time: 25m 05s
  • your time: 25m 23s

Is this 👆 enough to convince you that my mean time is faster than yours? Why/Why not?

Inference: Second Problem

Pro: 🤷‍♀️ the sample mean is an unbiased estimate

Inference: Second Problem

First problem: we need some more data…

set.seed(540)
n <- 1000
# generate fake crossword data
me <- rnorm(n, mean = 25, sd = 3)
you <- rnorm(n, mean = 28, sd = 3)

 df <- data.frame(source = sort(rep(c("me", "you"),n)),
                 scores = c(me,you))

Now that we have more data, Is this 👆 enough to convince you that my mean time is faster than yours? Why/Why not?

Quantifying Uncertainty

Frequentist Statistics

Frequentism main Ideas:

  1. data (\(X\)) is a random sample of our process \(P_{\theta}\), the parameters (\(\theta\)) are fixed

    • we imagine different samples that could exist
  2. inference relies on the idea of repeated samples of \(X\) from \(P_{\theta}\)

  3. probabilities are the long run frequency of an event

\[ p = \lim_{n \to \infty} \frac{k}{n} \]

Bayesian Statistics

Bayesianism main Ideas:

  1. data \(X\) is fixed, and the parameters \(\theta\) of our process \(P_{\theta}\) are random

    • we imagine different parameter values that could exist
  2. inference relies on the idea of updating prior beliefs based on evidence from the data

  3. probabilities are used to quantify uncertainty we have about parameters

\[ \underbrace{p(\theta|d)}_\text{posterior} = \underbrace{\frac{p(d|\theta)}{p(d)}}_\text{update} \times \underbrace{p(\theta)}_\text{prior} \]

Bayesian vs. Frequentist Probability

In the past month, it’s rained on 9 of the 30 days. 🌧️

  • Frequentist: the probability of rain is \(q = \frac{9}{30} = 0.3\)

  • Bayesian: before seeing the data, values of \(q\) near \(0.1\) sounded the most reasonable based on my knowledge of California. After seeing the data, I think the probability of rain \(q\) is most likely \(0.25\) but there’s a lot of uncertainty.

Frequentist Uncertainty

In a frequentist analysis, uncertainty represents sampling variability: the difference between estimates on repeated, similar samples.

💡 if i took a bunch of random samples exactly like this one, how variable would my estimates be?

Frequentist Uncertainty

If I take repeated random samples of 2 people from my list of J’s, what are the different mean ages that I get?

(hint: sample(c(17,24,21,19,25,28,23,20,23,25), size = 2, replace = TRUE)))

idx Name Age
1 John 17
2 James 24
3 Jane 21
4 June 19
5 Joachim 25
6 Jess 28
7 Javier 23
8 Jaques 20
9 Julie 23
10 Jackson 25

Frequentist Uncertainty

If I take repeated random samples of 8 people from my list of J’s, what are the different mean ages that I get?

(hint: sample(c(17,24,21,19,25,28,23,20,23,25), size = 8, replace = TRUE)))

idx Name Age
1 John 17
2 James 24
3 Jane 21
4 June 19
5 Joachim 25
6 Jess 28
7 Javier 23
8 Jaques 20
9 Julie 23
10 Jackson 25

Frequentist Uncertainty

Remember, our sample mean \(\bar{x}\) is our estimate of our population mean \(\mu\). In which case are the sample means we get most certain?

Frequentist Uncertainty

Let’s say income of Chapman workers is \(\text{income} \sim gamma(0.6, 100000)\)

❓if I sample 2 people, how likely is it that I’ll get a sample mean near $500,000

❓if I sample 2,000 people, how likely is it that I’ll get a sample mean near $500,000

Frequentist Uncertainty

  1. The more data we have, the more certain we are about our estimates

  2. The more data we have, the less likely we are to get all extreme values

Sampling Distributions

Sampling Distribution: the theoretical distribution of all possible estimates that result from taking a sample of size \(n\) from \(P_{\theta}\)

e.g. Sampling Distribution of \(\bar{x}\) , Sampling Distribution of \(\sigma\), Sampling Distribution of \(q\)

Sampling Distributions: Example

Sampling Distribution of \(q\)

coin_flips <- sample(0:1, size = 100, replace = TRUE) # heads = 1
mean(coin_flips) # proportion of heads
[1] 0.53

Everyone run this code 10 times, and put the proportions of heads in this sheet.

Sampling Distributions: Example

Sampling Distribution of \(q\)

How much uncertainty do we have about \(\hat{q}\)?

Sampling Distributions: Example

Sampling Distribution of \(q\)

  • What is the range of \(\hat{q}\)s that cover 90% of samples?
  • What is our best guess for what \(\hat{q}\) is?

Sampling Distributions: Example

Sampling Distribution of \(q\)

  • What is the range of \(\hat{q}\)s that cover 90% of samples?
# simulated sampling dist
coin_flips <- replicate(10000, mean(sample(0:1, size = 100, replace = TRUE)))

# calculate 5th and 95th percentile
ci <- quantile(coin_flips, c(0.05,0.95))
# plot
ggplot(data = data.frame(x = coin_flips),
       aes(x = x)) + geom_histogram(binwidth = 0.02, fill = "blue", color = "darkgray") + 
  xlim(c(0.2,0.8)) + 
  geom_segment(x = ci[[1]],
               xend = ci[[2]],
               y = -1,
               yend = -1,
               linewidth = 2) + 
  labs(x = expression(hat(q)),
       y = "",
       title = "Sampling Distribution of Sample Prop")

Sampling Distributions: Example

Sampling Distribution of \(q\)

  • What is our best guess for what \(\hat{q}\) is?
# simulated sampling dist
coin_flips <- replicate(10000, mean(sample(0:1, size = 100, replace = TRUE)))

# calculate mean
mu <- mean(coin_flips)

# plot
ggplot(data = data.frame(x = coin_flips),
       aes(x = x)) + geom_histogram(binwidth = 0.02, fill = "blue", color = "darkgray") + 
  xlim(c(0.2,0.8)) + 
  geom_vline(xintercept = mu,
             linewidth = 2) + 
  labs(x = expression(hat(q)),
       y = "",
       title = "Sampling Distribution of Sample Prop")

Sampling Distribution: Analytical

So far, we used Monte Carlo simulations to approximate the sampling distribution. But, often we can directly calculate it instead.



Claim: Sampling Distributions of sample means will (often) be a Normal Distribution, and we can use what we know about Normal Distributions to calculate our point estimate (best guess) and our interval estimates (uncertainty).

Review: LLN

For a random variable \(X\) with finite variance \(\sigma^2\) and expected value \(\mu\)

\[ P(|\bar{X_n} - \mu| \geq \epsilon) = \frac{Var(\bar{X})}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2} \]

As \(n \to \infty\), \(\frac{\sigma^2}{n\epsilon^2} \to 0\). So the probability that \(|\bar{X_n} - \mu| \geq \epsilon\) goes to \(0\). Thus, \(P(|\bar{X_n} - \mu| < \epsilon) \to 1\)


💡 In other words, \(\bar{X}_n \to \mu\), as \(n \to \infty\). The larger our sample, the more concentrated the sampling distribution will be around \(\mu\)

Central Limit Theorem

Note: this is the Central, Limit-Theorem:

Let \(\mathbf{X}\) be a random variable with finite variance \(\sigma^2\). As \(n \to \infty\), the distribution of sample means will be distributed as a normal distribution:

\[ \bar{x} \sim \mathcal{N}\left( \mu, \frac{\sigma^2}{n}\right) \]

Central Limit Theorem

Why is this useful?

  • we know a lot about the normal distribution, no more simulating to find out the point/interval estimate, we can calculate it directly!

  • even if the population data is not normally-distributed, the sampling distribution of \(\bar{x}\) will be!

Sampling Distribution: Analytical

\[ \hat{q} \sim \mathcal{N}\left( \mu, \frac{\sigma^2}{n}\right) \]

where \(\mu = \hat{q}\), and \(\sigma^2 = \hat{p}\hat{q}\) (the mean and variance of a bernoulli variable)

Sampling Distribution: Analytical

Now it’s easy to calculate the point estimate, and any interval estimate we want!

q <- mean(coin_flips)
n <- 100 # sample_size

# middle 90% of estimates
qnorm(c(0.05,0.95), mean = q,
      sd = sqrt((q*(1-q))/n))
[1] 0.4169164 0.5814016

Preview:

What can we do with these point estimates + uncertainty?

  • use them to estimate the values of our parameters \(\theta\)

    • my mean crossword time is 25.02 minutes 90% interval: [22.2-28.2]
  • use them to make decisions

    • my mean crossword time is lower than 28 minutes